## 'data.frame': 4898 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ quality_factor : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality quality_factor
## Min. : 8.00 Min. :3.000 3: 20
## 1st Qu.: 9.50 1st Qu.:5.000 4: 163
## Median :10.40 Median :6.000 5:1457
## Mean :10.51 Mean :5.878 6:2198
## 3rd Qu.:11.40 3rd Qu.:6.000 7: 880
## Max. :14.20 Max. :9.000 8: 175
## 9: 5
The data set contains 4898 objects with 14 variables. As the quality score lies between 0 and 10, it makes sense to add it as a factor. To do so, I added another variable quality_factor. Some of the variable seem to have extreme outliers, which we should take into account when creating plots for them.
To get an overview of the 13 variables, creating a grid with distribution histograms seems to be the best way to start.
The output looks good, so I am now creating plots for each variable. To ignore the outliers, I will set the limit of the axis to the 99%-quantile when this is necessary.
These three plots show the distribution of the three types of acid values in the data set. While the maximum amount for citric.acid is 1.66 g / dm^3, which is more than 5 times the mean (0.32 g / dm^3). For that reason, I will set the limit of the x-axis to 0.85. I will add another plot showing the log transformed values for citric.acid:
The distribution of sugar looks skewed in the grid above. So I will choose a different bin width and set a limit. The summary of the data shows the minimum is 0.60 and the maximum is 65.80 g/dm^3. However, 75% of the values are below 9.90, which is a huge difference.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
For this variable, we can also use a log10 scale to get a better view on the distribution:
Interstingly enough, in this histogram we can see that there are two peaks around 1.5 and 10.
This distribution has a long tail to the right, so I am setting a limit on the x-axis. The distribution is skewed left.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Both the distribution for free sulfur dioxides and total sulfur dioxides have a long tail to the right side, so I am discarding the outliers by setting a limit again.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
With a log10 transition, the distribution seem to be normal:
The values for density are between 0.987 and 1.039, so they are on a very small scale. I am setting the binwidth to 0.0005. The distribution looks close to a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The distribution looks scattered, so I will try to get more information with a log10 transformation:
For this variable the log10 transformation does not provide us with more information. I assume the reason is that the values are on a very small scale.
The pH value is normally distributed. No limits had to be set, so no outliers had to be removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The distribution of sulphates is screwed a little bit to the left side. I will set a limit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The distribution of alcohol is screwed left and spread. There are no outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
We are taking a look at the quality. As this variable was transformed to a factor, we use a barplot.
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The dataset is related to Portuguese wine. It has 4899 objects with 12 attributes.
Input variables (based on physicochemical tests):
fixed acidity (tartaric acid - g / dm^3)
volatile acidity (acetic acid - g / dm^3)
citric acid (g / dm^3)
residual sugar (g / dm^3)
chlorides (sodium chloride - g / dm^3
free sulfur dioxide (mg / dm^3)
total sulfur dioxide (mg / dm^3)
density (g / cm^3)
pH
sulphates (potassium sulphate - g / dm3)
alcohol (% by volume)
Output variable (based on sensory data):
In the Univariate Plots section, I have created plots for each variable, but I haven’t found something very surprising. Most of the distributions look like a normal distribution. I cannot identify any correlations yet, but this will be done in the next section.
Based on the description given, the main interest is the quality of wine. The quality was graded by experts, so it will be interesting to find out what actually influenced their grading or rather if there is a variable that influences the grade in either a positive or a negative direction.
With 11 input variables based on physicochemical tests and one output variable based on sensory data, there is enough to investigate.
I created the variable quality_factor to represent the quality in a factor format. This makes it possible to create barplots for this variable and also makes it easier to use it in the plots following in the next sections.
There was no need to tidy the dataset, there were no missing values.
I set several limits on the x-axis of the plots I created to avoid showing single outlier values. I do not know whether they were caused by measuring errors or if some wines have these extremly high/low values.
Something else I noticed is the distribution of the quality levels. There are no wines with a grade less than 3 and no wines with a score greater than 9. Most of them were graded as 6.
The aim of this section is to find out about what influences the quality. I will start with a correlation matrix in order to check all variables.
The correlation plot above shows that quality does not correlate with many of the other variables. There is a moderate positive correlation between quality and alcohol (0.43) and a moderate negative correlation between quality and density (-0.31).
The highest correlation is between density and residual.sugar (0.84), and density and alcohol (-0.78).
The first plot shows that there is a positive moderate correlation between quality and alcohol (.43). However, the boxplot gives us information that we couldn’t see before: from quality factor 3 to 4 and from 4 to 5, the median alcohol level drops. This is against the correlation we found before, so we have to find out what is wrong here.
##
## 3 4 5 6 7 8 9
## 0.987 0 0 2 0 5 1 0
## 0.988 0 0 0 9 8 0 0
## 0.989 0 2 4 70 66 24 0
## 0.99 0 8 31 170 132 28 3
## 0.991 2 16 40 241 185 31 1
## 0.992 3 13 154 284 156 26 0
## 0.993 2 25 166 290 88 21 0
## 0.994 3 27 203 235 67 15 0
## 0.995 2 25 166 247 49 9 0
## 0.996 1 16 219 190 23 3 0
## 0.997 3 8 144 160 17 0 1
## 0.998 2 17 169 155 42 8 0
## 0.999 0 4 80 95 22 6 0
## 1 2 2 65 37 20 1 0
## 1.001 0 0 11 7 0 2 0
## 1.002 0 0 3 3 0 0 0
## 1.003 0 0 0 2 0 0 0
## 1.01 0 0 0 2 0 0 0
## 1.039 0 0 0 1 0 0 0
This table gives us an explanation for this finding: there are relatively few wines graded with 3 and 4. For this reason, these values do have less influence on the correlation coefficient.
Both plots show the negative correlation between quality and density. Similar to the previous plots, this is not very obvious for the lower quality levels. The level 5 wines even have the highest median for density.
The plot shows a trend that a wines with a low density have more alcohol.
As I decided to focus on the quality variable I paid special attention to the correlation between quality and other variables. There is a moderate positive correlation with alcohol (0.43) and a moderate negative correlation with density (-0.31).
As the correlation matrix indicates, most of the variables do not have a strong correlation with each other. For that reason, I did not take a closer look at them.
The strongest I found is between density and residual.sugar (0.84), and density and alcohol (-0.78).
This is a plot for the two variables that correlate the most with alcohol. We can see that the points get a lighter colour the higher the density is and the lower the alcohol value.
This plot represent the two variables with the highest correlation. I have coloured them by factorizing the rounded value of alcohol. You can see a clear see how the colours are changing in the plot.
This plot shows the quantities of each quality factor for each level of alcohol. You can see that the blue part is growing from the left to the right. Only the leftmost bar has a distinct red part, which represents the lowest level.
The plots confirm the correlation I had found before.
Looking at the colours of both plots you can see the different correlations: while the first plot represents only moderate correlations, the second has a clearer colouring. Furthermore, it illustrates the difference between a positive and a negative correlation. While the points in the first plot (for density and residual sugar) are aligned on an axis from the lower left to the upper right corner, the second plot (for density and alcohol) is from the upper left to the lower right corner.
I think it is quite interesting to factorize values in order to use them as a category for plots. While this might not be a good solution for some values, it seems to be a suitable solution for the amount of alcohol.
This scatterplot shows the distribution the variable residual.sugar on the x-axis and the variable density on the y-axis. The points are coloured by the rounded amount of alcohol. It’s obvious that the dominating colour is changing from purple in the lower left of the plot (wines with low density and low residual) to blue in the upper right. This means that the amount of alcohol is decreasing.
For me, this is the most suprising plot in this report. While there is a correlation between the quality and alcohol (.43), it’s not possible to see this correlation in the box plot. One could intuitively think that the boxes should be higher with each quality level as these variables correlate positively. This is not true for level 3, 4 and 5. On the contrary, they are even lower for these three levels.
As I have explained in the section above, there are relatively few wines graded with 3 and 4, so this plot is surprising, but it can be explained. However, it shows that it is often not enough to present data with only one plot. Showing only this boxplot could have lead to the wrong impression that wine with a low density has a middle quality level.
In the analysis of the data set I found two variables that influence the quality of wine significantly and took a closer look at two variables in the dataset, which had the highest correlation overall. The quality is an interesting value to look at because it is the only subjective one. So we can find out which measured value influence how experts evaluate the quality.
The data set was clean, the documentation was really good, so I did not run into any mayor problems. I have worked a lot with R in the past and have generated many plots for my own research. I would not consider myself as a professional, but I know what is possible and where I can find answers to my questions. However, I was not really happy that there were no high correlations in the data set, which would have made the analysis a bit smoother and would have given me more options to create interesting plots.
For future work, it would be interesting to create a model to predict the quality of wine by the measured values. It would be really interesting to find out which values a “perfect wine” would have. If we conducted this reasearch for a wine selling company, we could even try to create such a wine and let it be graded by the experts. I am not sure if it is possible to create a reliable model as some of the values have a very low correlation with quality, so they cannot be used for a prediction. Furthermore, as already mentioned above, the quality scores do not give a lot of information as only 7 grades were assigned in this data set.